AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Install the uszipcode package to handle the zipcodes
!pip install uszipcode
Requirement already satisfied: uszipcode in c:\app\anaconda3\lib\site-packages (1.0.1) Requirement already satisfied: haversine>=2.5.0 in c:\app\anaconda3\lib\site-packages (from uszipcode) (2.8.0) Requirement already satisfied: SQLAlchemy>=1.4.0 in c:\app\anaconda3\lib\site-packages (from uszipcode) (1.4.39) Requirement already satisfied: pathlib-mate in c:\app\anaconda3\lib\site-packages (from uszipcode) (1.2.1) Requirement already satisfied: requests in c:\app\anaconda3\lib\site-packages (from uszipcode) (2.28.1) Requirement already satisfied: atomicwrites in c:\app\anaconda3\lib\site-packages (from uszipcode) (1.4.0) Requirement already satisfied: sqlalchemy-mate>=1.4.28.3 in c:\app\anaconda3\lib\site-packages (from uszipcode) (1.4.28.4) Requirement already satisfied: attrs in c:\app\anaconda3\lib\site-packages (from uszipcode) (22.1.0) Requirement already satisfied: fuzzywuzzy in c:\app\anaconda3\lib\site-packages (from uszipcode) (0.18.0) Requirement already satisfied: greenlet!=0.4.17 in c:\app\anaconda3\lib\site-packages (from SQLAlchemy>=1.4.0->uszipcode) (2.0.1) Requirement already satisfied: prettytable in c:\app\anaconda3\lib\site-packages (from sqlalchemy-mate>=1.4.28.3->uszipcode) (3.8.0) Requirement already satisfied: six in c:\app\anaconda3\lib\site-packages (from pathlib-mate->uszipcode) (1.16.0) Requirement already satisfied: certifi>=2017.4.17 in c:\app\anaconda3\lib\site-packages (from requests->uszipcode) (2022.12.7) Requirement already satisfied: charset-normalizer<3,>=2 in c:\app\anaconda3\lib\site-packages (from requests->uszipcode) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\app\anaconda3\lib\site-packages (from requests->uszipcode) (3.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\app\anaconda3\lib\site-packages (from requests->uszipcode) (1.26.14) Requirement already satisfied: wcwidth in c:\app\anaconda3\lib\site-packages (from prettytable->sqlalchemy-mate>=1.4.28.3->uszipcode) (0.2.5)
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Librarey to help with reading and manipulating data
import pandas as pd
import numpy as np
#matplotlib inline
# Libraries to split the data
from sklearn.model_selection import train_test_split
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from IPython.display import Image
from os import system
# To tune different models
from sklearn.model_selection import GridSearchCV
from uszipcode import SearchEngine, SimpleZipcode
search = SearchEngine()
# To get diferent metric scores
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
make_scorer,
)
#from google.colab import drive
#drive.mount("/content/drive")
#data =pd.read_csv("/content/drive/MyDrive/Machine Learning/Loan_Modelling.csv")
data =pd.read_csv("C:/Users/jkama/Downloads/Loan_Modelling.csv")
# Copying data to another variable to avoid any changes to original data
loan_data = data.copy()
data.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
View the first and last 5 rows of the data set
loan_data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
loan_data.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
Under the shape of the dataset
loan_data.shape
(5000, 14)
loan_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
Observerations -
- Personal_Loan is teh dependent variable - int64
- All the dependent variables are of number type(Integar or float)
loan_data.describe(include='all').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
Observations-
- Average customer is of age 45 and customers are between age group 23 and 67
- Customer avareage Income is 73K and 75 percentile customers has 98K s income
- Customer average family size is 2 and the range is between 1 and 4
- Average customer has gradute level study and range is between undergraduate to Advanced/Professional level
- Experience in negative numbers does not make sense need nore data check on this
#Experience in - does not make sense
loan_data.sort_values(by=["Experience"],ascending=True).head(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4514 | 4515 | 24 | -3 | 41 | 91768 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2618 | 2619 | 23 | -3 | 55 | 92704 | 3 | 2.4 | 2 | 145 | 0 | 0 | 0 | 1 | 0 |
| 4285 | 4286 | 23 | -3 | 149 | 93555 | 2 | 7.2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3626 | 3627 | 24 | -3 | 28 | 90089 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3796 | 3797 | 24 | -2 | 50 | 94920 | 3 | 2.4 | 2 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2717 | 2718 | 23 | -2 | 45 | 95422 | 4 | 0.6 | 2 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4481 | 4482 | 25 | -2 | 35 | 95045 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3887 | 3888 | 24 | -2 | 118 | 92634 | 2 | 7.2 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
| 2876 | 2877 | 24 | -2 | 80 | 91107 | 2 | 1.6 | 3 | 238 | 0 | 0 | 0 | 0 | 0 |
| 2962 | 2963 | 23 | -2 | 81 | 91711 | 2 | 1.8 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
#Let us replace the negative values with zero. Because, observiing the data values in the data set the experience in this age group are mostly zero
loan_data['Experience'].replace(-3,0,inplace=True)
loan_data['Experience'].replace(-2,0,inplace=True)
loan_data['Experience'].replace(-1,0,inplace=True)
# Check if the ID columns is unique
loan_data["ID"].nunique()
5000
#Since the ID is unique value, this is not of much use. Let us drop it
loan_data.drop("ID",axis=1, inplace=True)
# Check if there are duplicates in the dataset after dropping ID
loan_data.duplicated().sum()
0
print(loan_data.Age.value_counts())
print(loan_data.Experience.value_counts())
print(loan_data.Income.value_counts())
print(loan_data.ZIPCode.value_counts())
35 151
43 149
52 145
54 143
58 143
50 138
41 136
30 136
56 135
34 134
39 133
57 132
59 132
51 129
45 127
60 127
46 127
42 126
31 125
40 125
55 125
29 123
62 123
61 122
44 121
32 120
33 120
48 118
38 115
49 115
47 113
53 112
63 108
36 107
37 106
28 103
27 91
65 80
64 78
26 78
25 53
24 28
66 24
67 12
23 12
Name: Age, dtype: int64
32 154
20 148
9 147
5 146
23 144
35 143
25 142
28 138
18 137
19 135
26 134
24 131
3 129
16 127
14 127
30 126
17 125
27 125
34 125
22 124
29 124
7 121
6 119
8 119
15 119
10 118
0 118
33 117
13 117
11 116
37 116
36 114
4 113
21 113
31 104
12 102
38 88
2 85
39 85
1 74
40 57
41 43
42 8
43 3
Name: Experience, dtype: int64
44 85
38 84
81 83
41 82
39 81
..
202 2
203 2
189 2
224 1
218 1
Name: Income, Length: 162, dtype: int64
94720 169
94305 127
95616 116
90095 71
93106 57
...
96145 1
94087 1
91024 1
93077 1
94598 1
Name: ZIPCode, Length: 467, dtype: int64
loan_data['ZIPCode'].info()
<class 'pandas.core.series.Series'> RangeIndex: 5000 entries, 0 to 4999 Series name: ZIPCode Non-Null Count Dtype -------------- ----- 5000 non-null int64 dtypes: int64(1) memory usage: 39.2 KB
While analyzing the Zipcode values notices few invalid zipcodes. Below we will replace the invalid zipcodes with valid nearby zipcodes
#Replace the invalid zipcode 92717
loan_data['ZIPCode'].replace(92717,92707,inplace=True)
loan_data['ZIPCode'].replace(93077,93007,inplace=True)
loan_data['ZIPCode'].replace(96651,95651,inplace=True)
loan_data['ZIPCode'].replace(92634,92834,inplace=True)
#Function to replace the unique zipcodes with city, county and state to reduce the too many unique values
import uszipcode
search = SearchEngine()
# Function to return the city
def zto(x):
if pd.isnull(x):
return None
else:
city = search.by_zipcode(x).major_city
return city
# Function to return the County
def zco(x):
if pd.isnull(x):
return None
else:
cnty = search.by_zipcode(x).county
return cnty
# Function to return the State
def zso(x):
if pd.isnull(x):
return None
else:
st = search.by_zipcode(x).state
return st
# Create the new data columns for City, County and State
loan_data['City'] = loan_data['ZIPCode'].apply(zto)
loan_data['County'] = loan_data['ZIPCode'].apply(zco)
loan_data['State'] = loan_data['ZIPCode'].apply(zso)
# We can also drop the zipcode as we have processed to have City, County and State columns
loan_data.drop("ZIPCode", axis=1, inplace=True)
#Check how many Unique Cities, County and State and their counts in the list
print(f"Unique cities: {loan_data['City'].nunique()}")
print(loan_data["City"].value_counts())
print("-" * 50, '\n')
print(f"Unique Counties: {loan_data['County'].nunique()}")
print(loan_data["County"].value_counts())
print("-" * 50, '\n')
print(f"Unique State: {loan_data['State'].nunique()}")
loan_data["State"].value_counts()
Unique cities: 245
Los Angeles 375
San Diego 269
San Francisco 257
Berkeley 241
Sacramento 148
...
Sausalito 1
Ladera Ranch 1
Sierra Madre 1
Tahoe City 1
Stinson Beach 1
Name: City, Length: 245, dtype: int64
--------------------------------------------------
Unique Counties: 38
Los Angeles County 1095
San Diego County 568
Santa Clara County 563
Alameda County 500
Orange County 366
San Francisco County 257
San Mateo County 204
Sacramento County 184
Santa Barbara County 154
Yolo County 130
Monterey County 128
Ventura County 115
San Bernardino County 101
Contra Costa County 85
Santa Cruz County 68
Riverside County 56
Marin County 54
Kern County 54
San Luis Obispo County 33
Solano County 33
Humboldt County 32
Sonoma County 28
Fresno County 26
Placer County 24
El Dorado County 23
Butte County 19
Shasta County 18
Stanislaus County 15
San Benito County 14
San Joaquin County 13
Mendocino County 8
Siskiyou County 7
Tuolumne County 7
Merced County 4
Trinity County 4
Lake County 4
Imperial County 3
Napa County 3
Name: County, dtype: int64
--------------------------------------------------
Unique State: 1
CA 5000 Name: State, dtype: int64
#Here State column contains same constant values. This feature won't be useful in making the prediction of the target variable as it does not provide any useful insights into the data.
loan_data.drop("State",axis=1, inplace=True)
#Too many unique values in the city and we will drop it as it will not that useful
loan_data.drop("City",axis=1, inplace=True)
loan_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null int64 4 CCAvg 5000 non-null float64 5 Education 5000 non-null int64 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null int64 9 CD_Account 5000 non-null int64 10 Online 5000 non-null int64 11 CreditCard 5000 non-null int64 12 County 5000 non-null object dtypes: float64(1), int64(11), object(1) memory usage: 507.9+ KB
# Let us create the agegroup as there are too many unique values
#loan_data["AgeGroup"]=pd.cut(loan_data['Age'], bins=[23, 29, 35, 41,47,53,59,65], right=True, labels=[1, 2, 3,4,5,6,7])
#print(loan_data["AgeGroup"].value_counts())
#Let us create the dummy varaibles for the categry county
oneHotCols=["County"]
loan_data=pd.get_dummies(loan_data, columns=oneHotCols)
loan_data.head(25)
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | ... | County_Santa Cruz County | County_Shasta County | County_Siskiyou County | County_Solano County | County_Sonoma County | County_Stanislaus County | County_Trinity County | County_Tuolumne County | County_Ventura County | County_Yolo County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 37 | 13 | 29 | 4 | 0.4 | 2 | 155 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 53 | 27 | 72 | 2 | 1.5 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 50 | 24 | 22 | 1 | 0.3 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 35 | 10 | 81 | 3 | 0.6 | 2 | 104 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 34 | 9 | 180 | 1 | 8.9 | 3 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 10 | 65 | 39 | 105 | 4 | 2.4 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11 | 29 | 5 | 45 | 3 | 0.1 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 48 | 23 | 114 | 2 | 3.8 | 3 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 59 | 32 | 40 | 4 | 2.5 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 67 | 41 | 112 | 1 | 2.0 | 1 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15 | 60 | 30 | 22 | 1 | 1.5 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 16 | 38 | 14 | 130 | 4 | 4.7 | 3 | 134 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 42 | 18 | 81 | 4 | 2.4 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 18 | 46 | 21 | 193 | 2 | 8.1 | 3 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 19 | 55 | 28 | 21 | 1 | 0.5 | 2 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 20 | 56 | 31 | 25 | 4 | 0.9 | 2 | 111 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 21 | 57 | 27 | 63 | 3 | 2.0 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 22 | 29 | 5 | 62 | 1 | 1.2 | 1 | 260 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 23 | 44 | 18 | 43 | 2 | 0.7 | 1 | 163 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 24 | 36 | 11 | 152 | 2 | 3.9 | 1 | 159 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25 rows × 50 columns
Questions:
Answer: Most of the customers does not have mortgage and there are many outliers
1470 Customer have credit cards
Income, education, Family have a strong corelation with target attribute
Customers in age group 30 - 61 is purchasing the loan more often and the age below 30 and over 61 is less often purchaing the loand
As the education increases purchaing a loan alos increases
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# function to create labeled bar plots
def labeled_barplots(data, feature, perc=False, n=None):
total = len(data[feature])
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count+1,5))
else:
plt.figure(figsize=(n+1,5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values()
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot a boxplot and a histogram along the same scale
def histogram_boxplot(data, feature, figsize=(12,7),kde=False, bins=None):
f2,(ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios":(0.25,0.75)},
figsize=figsize,
)
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
)
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2,bins=bins,palette="winter"
) if bins else sns.histplot(
data =data, x=feature, kde=kde,ax=ax_hist2
)
ax_hist2.axvline(
data[feature].mean(), color="green",linestyle="-"
)
ax_hist2.axvline(
data[feature].median(),color="black",linestyle="-"
)
histogram_boxplot(loan_data,"Age",kde=True)
histogram_boxplot(loan_data,"Experience",kde=True)
histogram_boxplot(loan_data,"Income",kde=True)
labeled_barplots(loan_data,"Family",perc=True,n=10)
histogram_boxplot(loan_data,"CCAvg",kde=True)
labeled_barplots(loan_data,"Education",perc=True,n=10)
histogram_boxplot(loan_data,"Mortgage",kde=True)
labeled_barplots(loan_data,"Personal_Loan",perc=True,n=10)
labeled_barplots(loan_data,"Securities_Account",perc=True,n=10)
labeled_barplots(loan_data,"CD_Account",perc=True,n=10)
94% percent of the customers does not have CD account and 6% of the customers have CD Account
labeled_barplots(loan_data,"Online",perc=True,n=10)
60% of customers has online banking facilities s while 40% still does not use online banking facilities
labeled_barplots(loan_data,"CreditCard",perc=True,n=10)
30% of the customers use other bank credit card while 70% of the customers does not use other bank credit card
Age vs Personal Loan
stacked_barplot(loan_data, "Age", "Personal_Loan")
Personal_Loan 0 1 All Age All 4520 480 5000 34 116 18 134 30 119 17 136 36 91 16 107 63 92 16 108 35 135 16 151 33 105 15 120 52 130 15 145 29 108 15 123 54 128 15 143 43 134 15 149 42 112 14 126 56 121 14 135 65 66 14 80 44 107 14 121 50 125 13 138 45 114 13 127 46 114 13 127 26 65 13 78 32 108 12 120 57 120 12 132 38 103 12 115 27 79 12 91 48 106 12 118 61 110 12 122 53 101 11 112 51 119 10 129 60 117 10 127 58 133 10 143 49 105 10 115 47 103 10 113 59 123 9 132 28 94 9 103 62 114 9 123 55 116 9 125 64 70 8 78 41 128 8 136 40 117 8 125 37 98 8 106 31 118 7 125 39 127 6 133 24 28 0 28 25 53 0 53 66 24 0 24 67 12 0 12 23 12 0 12 ------------------------------------------------------------------------------------------------------------------------
Experience vs Persoanl Loan
stacked_barplot(loan_data, "Experience", "Personal_Loan")
Personal_Loan 0 1 All Experience All 4520 480 5000 9 127 20 147 8 101 18 119 3 112 17 129 20 131 17 148 12 86 16 102 5 132 14 146 32 140 14 154 26 120 14 134 25 128 14 142 19 121 14 135 16 114 13 127 37 103 13 116 35 130 13 143 30 113 13 126 23 131 13 144 22 111 13 124 11 103 13 116 31 92 12 104 36 102 12 114 6 107 12 119 18 125 12 137 7 109 12 121 29 112 12 124 28 127 11 138 17 114 11 125 13 106 11 117 21 102 11 113 39 75 10 85 34 115 10 125 27 115 10 125 4 104 9 113 2 76 9 85 24 123 8 131 1 66 8 74 38 80 8 88 41 36 7 43 10 111 7 118 33 110 7 117 0 111 7 118 14 121 6 127 15 114 5 119 40 53 4 57 42 8 0 8 43 3 0 3 ------------------------------------------------------------------------------------------------------------------------
Experience is not have much influence on the personal loan
Income vs Personal
stacked_barplot(loan_data, "Income", "Personal_Loan")
Personal_Loan 0 1 All Income All 4520 480 5000 130 8 11 19 182 2 11 13 158 8 10 18 135 8 10 18 ... ... ... ... 41 82 0 82 40 78 0 78 39 81 0 81 38 84 0 84 8 23 0 23 [163 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
Higher the income the people tend to take the personal loan
Family vs Personal Loan
stacked_barplot(loan_data, "Family", "Personal_Loan")
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ------------------------------------------------------------------------------------------------------------------------
As the family size increases, loan purchase tend to increase
CCAvg vs Personal Loan
stacked_barplot(loan_data, "CCAvg", "Personal_Loan")
Personal_Loan 0 1 All CCAvg All 4520 480 5000 3.0 34 19 53 4.1 9 13 22 3.4 26 13 39 3.1 8 12 20 ... ... ... ... 1.67 18 0 18 1.75 9 0 9 7.8 9 0 9 7.6 9 0 9 2.33 18 0 18 [109 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
As the number of credit card purchase increases the personal loan also increases
Education vs Personal Loan
stacked_barplot(loan_data, "Education", "Personal_Loan")
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------
Mortgage vs Personal Loan
stacked_barplot(loan_data, "Mortgage", "Personal_Loan")
Personal_Loan 0 1 All Mortgage All 4520 480 5000 0 3150 312 3462 301 0 5 5 342 1 3 4 282 0 3 3 ... ... ... ... 276 2 0 2 156 5 0 5 278 1 0 1 280 2 0 2 248 3 0 3 [348 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
Securities_Account vs Personal Loan
stacked_barplot(loan_data, "Securities_Account", "Personal_Loan")
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ------------------------------------------------------------------------------------------------------------------------
CD_Account vs Personal Loan
stacked_barplot(loan_data, "CD_Account", "Personal_Loan")
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ------------------------------------------------------------------------------------------------------------------------
Online vs Personal Loan
stacked_barplot(loan_data, "Online", "Personal_Loan")
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ------------------------------------------------------------------------------------------------------------------------
CreditCard vs Personal Loan
stacked_barplot(loan_data, "CreditCard", "Personal_Loan")
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ------------------------------------------------------------------------------------------------------------------------
X = loan_data.drop(["Personal_Loan"], axis=1)
y = loan_data["Personal_Loan"]
loan_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null int64 4 CCAvg 5000 non-null float64 5 Education 5000 non-null int64 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null int64 9 CD_Account 5000 non-null int64 10 Online 5000 non-null int64 11 CreditCard 5000 non-null int64 12 County_Alameda County 5000 non-null uint8 13 County_Butte County 5000 non-null uint8 14 County_Contra Costa County 5000 non-null uint8 15 County_El Dorado County 5000 non-null uint8 16 County_Fresno County 5000 non-null uint8 17 County_Humboldt County 5000 non-null uint8 18 County_Imperial County 5000 non-null uint8 19 County_Kern County 5000 non-null uint8 20 County_Lake County 5000 non-null uint8 21 County_Los Angeles County 5000 non-null uint8 22 County_Marin County 5000 non-null uint8 23 County_Mendocino County 5000 non-null uint8 24 County_Merced County 5000 non-null uint8 25 County_Monterey County 5000 non-null uint8 26 County_Napa County 5000 non-null uint8 27 County_Orange County 5000 non-null uint8 28 County_Placer County 5000 non-null uint8 29 County_Riverside County 5000 non-null uint8 30 County_Sacramento County 5000 non-null uint8 31 County_San Benito County 5000 non-null uint8 32 County_San Bernardino County 5000 non-null uint8 33 County_San Diego County 5000 non-null uint8 34 County_San Francisco County 5000 non-null uint8 35 County_San Joaquin County 5000 non-null uint8 36 County_San Luis Obispo County 5000 non-null uint8 37 County_San Mateo County 5000 non-null uint8 38 County_Santa Barbara County 5000 non-null uint8 39 County_Santa Clara County 5000 non-null uint8 40 County_Santa Cruz County 5000 non-null uint8 41 County_Shasta County 5000 non-null uint8 42 County_Siskiyou County 5000 non-null uint8 43 County_Solano County 5000 non-null uint8 44 County_Sonoma County 5000 non-null uint8 45 County_Stanislaus County 5000 non-null uint8 46 County_Trinity County 5000 non-null uint8 47 County_Tuolumne County 5000 non-null uint8 48 County_Ventura County 5000 non-null uint8 49 County_Yolo County 5000 non-null uint8 dtypes: float64(1), int64(11), uint8(38) memory usage: 654.4 KB
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.35, random_state=1)
Build Decision Tree Model
model = DecisionTreeClassifier(criterion='gini',random_state=1)
model.fit(X_train,y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
print("Accuracy on training set : ",model.score(X_train, y_train))
print("Accuracy on test set : ",model.score(X_test, y_test))
Accuracy on training set : 1.0 Accuracy on test set : 0.9771428571428571
column_names = list(X.columns)
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'County_Alameda County', 'County_Butte County', 'County_Contra Costa County', 'County_El Dorado County', 'County_Fresno County', 'County_Humboldt County', 'County_Imperial County', 'County_Kern County', 'County_Lake County', 'County_Los Angeles County', 'County_Marin County', 'County_Mendocino County', 'County_Merced County', 'County_Monterey County', 'County_Napa County', 'County_Orange County', 'County_Placer County', 'County_Riverside County', 'County_Sacramento County', 'County_San Benito County', 'County_San Bernardino County', 'County_San Diego County', 'County_San Francisco County', 'County_San Joaquin County', 'County_San Luis Obispo County', 'County_San Mateo County', 'County_Santa Barbara County', 'County_Santa Clara County', 'County_Santa Cruz County', 'County_Shasta County', 'County_Siskiyou County', 'County_Solano County', 'County_Sonoma County', 'County_Stanislaus County', 'County_Trinity County', 'County_Tuolumne County', 'County_Ventura County', 'County_Yolo County']
plt.figure(figsize=(20, 20))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Checking model performance on training set
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
confusion_matrix_sklearn(model, X_train, y_train)
Checking performance on test set
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.977143 | 0.872093 | 0.892857 | 0.882353 |
confusion_matrix_sklearn(model, X_test, y_test)
*
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
l_model = LogisticRegression(solver='liblinear')
l_model.fit(X_train, y_train)
y_predict = l_model.predict(X_test)
coef_df = pd.DataFrame(l_model.coef_)
coef_df['intercept'] = l_model.intercept_
print(coef_df)
0 1 2 3 4 5 6 \
0 -0.402665 0.3999 0.049847 0.655399 0.151515 1.645989 0.000612
7 8 9 ... 40 41 42 43 \
0 -0.914399 3.294787 -0.583168 ... -0.214918 -0.048406 0.285591 0.522395
44 45 46 47 48 intercept
0 -0.261853 -0.099204 -0.225719 0.078178 -0.448303 -2.316842
[1 rows x 50 columns]
model_score = l_model.score(X_test,y_test)
print(model_score)
0.948
cm = metrics.confusion_matrix(y_test,y_predict,labels=[1,0])
df_cm = pd.DataFrame(cm, index=[i for i in["1","0"]],
columns=[i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize=(7,5))
sns.heatmap(df_cm, annot=True)
<Axes: >
Visualizing the decision Tree
column_names = list(X.columns)
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'County_Alameda County', 'County_Butte County', 'County_Contra Costa County', 'County_El Dorado County', 'County_Fresno County', 'County_Humboldt County', 'County_Imperial County', 'County_Kern County', 'County_Lake County', 'County_Los Angeles County', 'County_Marin County', 'County_Mendocino County', 'County_Merced County', 'County_Monterey County', 'County_Napa County', 'County_Orange County', 'County_Placer County', 'County_Riverside County', 'County_Sacramento County', 'County_San Benito County', 'County_San Bernardino County', 'County_San Diego County', 'County_San Francisco County', 'County_San Joaquin County', 'County_San Luis Obispo County', 'County_San Mateo County', 'County_Santa Barbara County', 'County_Santa Clara County', 'County_Santa Cruz County', 'County_Shasta County', 'County_Siskiyou County', 'County_Solano County', 'County_Sonoma County', 'County_Stanislaus County', 'County_Trinity County', 'County_Tuolumne County', 'County_Ventura County', 'County_Yolo County']
from numpy.ma import filled
plt.figure(figsize=(20,30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Text Report showing the rules used by the decision tree
print(tree.export_text(model, feature_names=feature_names,show_weights=True))
|--- Income <= 110.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2366.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- weights: [13.00, 0.00] class: 0 | | | |--- Family > 3.50 | | | | |--- Age <= 38.50 | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | |--- Age > 38.50 | | | | | |--- weights: [0.00, 2.00] class: 1 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 92.50 | | | | |--- County_Riverside County <= 0.50 | | | | | |--- Experience <= 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Experience > 0.50 | | | | | | |--- County_San Francisco County <= 0.50 | | | | | | | |--- Age <= 62.50 | | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | | |--- Age <= 36.50 | | | | | | | | | | |--- Education <= 1.50 | | | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | | | | |--- Education > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- Age > 36.50 | | | | | | | | | | |--- County_Alameda County <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- County_Alameda County > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- CCAvg > 3.75 | | | | | | | | | |--- weights: [48.00, 0.00] class: 0 | | | | | | | |--- Age > 62.50 | | | | | | | | |--- County_San Diego County <= 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- County_San Diego County > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- County_San Francisco County > 0.50 | | | | | | | |--- Income <= 82.50 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | |--- Income > 82.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- County_Riverside County > 0.50 | | | | | |--- Age <= 40.00 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Age > 40.00 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- Income > 92.50 | | | | |--- Education <= 1.50 | | | | | |--- Online <= 0.50 | | | | | | |--- Income <= 102.00 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Income > 102.00 | | | | | | | |--- Family <= 2.50 | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- Family > 2.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Online > 0.50 | | | | | | |--- weights: [15.00, 0.00] class: 0 | | | | |--- Education > 1.50 | | | | | |--- Mortgage <= 172.00 | | | | | | |--- Experience <= 37.50 | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | |--- Experience > 37.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Mortgage > 172.00 | | | | | | |--- CCAvg <= 4.20 | | | | | | | |--- Mortgage <= 244.00 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- Mortgage > 244.00 | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | |--- CCAvg > 4.20 | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- Family <= 1.50 | | | | |--- Securities_Account <= 0.50 | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | |--- Securities_Account > 0.50 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- Family > 1.50 | | | | |--- weights: [0.00, 8.00] class: 1 |--- Income > 110.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [384.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Age <= 26.00 | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- Age > 26.00 | | | | |--- Income <= 113.50 | | | | | |--- CD_Account <= 0.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- CD_Account > 0.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Income > 113.50 | | | | | |--- weights: [0.00, 48.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 3.50 | | | | |--- CCAvg <= 2.80 | | | | | |--- County_Santa Barbara County <= 0.50 | | | | | | |--- CCAvg <= 0.70 | | | | | | | |--- Age <= 55.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Age > 55.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CCAvg > 0.70 | | | | | | | |--- CCAvg <= 1.55 | | | | | | | | |--- weights: [11.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 1.55 | | | | | | | | |--- CCAvg <= 1.75 | | | | | | | | | |--- Age <= 51.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- Age > 51.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 1.75 | | | | | | | | | |--- Family <= 1.50 | | | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- Family > 1.50 | | | | | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | |--- County_Santa Barbara County > 0.50 | | | | | | |--- Mortgage <= 125.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Mortgage > 125.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- CCAvg > 2.80 | | | | | |--- CCAvg <= 3.35 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- CCAvg > 3.35 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- CCAvg > 3.50 | | | | |--- weights: [0.00, 7.00] class: 1 | | |--- Income > 116.50 | | | |--- weights: [0.00, 204.00] class: 1
# Let us check which feature importances to understand which played major role in decisions
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title("Feature Importances")
plt.barh(range(len(indices)),importances[indices],color='violet',align="center")
plt.yticks(range(len(indices)),[feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Using GridSearch for Hyperparameter tuning of our tree model
Let us see if we can improve our performance model even more using pre pruning
GridSearchCV,
#Choose the type of classifer
estimator = DecisionTreeClassifier(random_state=1)
#Choose parameters
parameters = {
"max_depth": np.arange(2,50),
"criterion": ["entropy","gini"],
"splitter":["best","random"],
"min_impurity_decrease":[0.000001,0.00001,0.0001]
}
#Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
#Run the grid search
grid_obj = GridSearchCV(estimator, parameters,scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
#set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
#Fit the best algorithm to the data
estimator.fit(X_train,y_train)
DecisionTreeClassifier(criterion='entropy', max_depth=4,
min_impurity_decrease=1e-06, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(criterion='entropy', max_depth=4,
min_impurity_decrease=1e-06, random_state=1)Checking Performance on training set with the pre pruned tree
decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator,X_train,y_train)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.988923 | 0.922078 | 0.959459 | 0.940397 |
confusion_matrix_sklearn(estimator,X_train,y_train)
Checking model performacne on test set
decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator, X_test,y_test)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982286 | 0.872093 | 0.943396 | 0.906344 |
confusion_matrix_sklearn(estimator,X_test, y_test)
plt.figure(figsize=(15,12))
tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True
)
plt.show()
print(tree.export_text(estimator, feature_names=feature_names,show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [2258.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- weights: [72.00, 10.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [39.00, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.00, 4.00] class: 1 |--- Income > 92.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 103.50 | | | | |--- weights: [42.00, 3.00] class: 0 | | | |--- Income > 103.50 | | | | |--- weights: [404.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 113.50 | | | | |--- weights: [15.00, 5.00] class: 0 | | | |--- Income > 113.50 | | | | |--- weights: [0.00, 48.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.85 | | | | |--- weights: [100.00, 6.00] class: 0 | | | |--- CCAvg > 2.85 | | | | |--- weights: [12.00, 28.00] class: 1 | | |--- Income > 116.50 | | | |--- weights: [0.00, 204.00] class: 1
We will check important features used in the tree building
print (
pd.DataFrame(
estimator.feature_importances_,columns=["Imp"],index=X_train.columns
).sort_values(by="Imp",ascending=False)
)
Imp Income 0.598064 Education 0.172599 Family 0.134378 CCAvg 0.084803 CD_Account 0.010155 Age 0.000000 County_Santa Clara County 0.000000 County_Sacramento County 0.000000 County_San Benito County 0.000000 County_San Bernardino County 0.000000 County_San Diego County 0.000000 County_San Francisco County 0.000000 County_San Joaquin County 0.000000 County_San Luis Obispo County 0.000000 County_San Mateo County 0.000000 County_Santa Barbara County 0.000000 County_Siskiyou County 0.000000 County_Santa Cruz County 0.000000 County_Shasta County 0.000000 County_Placer County 0.000000 County_Solano County 0.000000 County_Sonoma County 0.000000 County_Stanislaus County 0.000000 County_Trinity County 0.000000 County_Tuolumne County 0.000000 County_Ventura County 0.000000 County_Riverside County 0.000000 County_Monterey County 0.000000 County_Orange County 0.000000 County_Fresno County 0.000000 Mortgage 0.000000 Securities_Account 0.000000 Online 0.000000 CreditCard 0.000000 County_Alameda County 0.000000 County_Butte County 0.000000 County_Contra Costa County 0.000000 County_El Dorado County 0.000000 County_Humboldt County 0.000000 County_Napa County 0.000000 County_Imperial County 0.000000 County_Kern County 0.000000 County_Lake County 0.000000 County_Los Angeles County 0.000000 County_Marin County 0.000000 County_Mendocino County 0.000000 County_Merced County 0.000000 Experience 0.000000 County_Yolo County 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8,12))
plt.title("Feature Importances on Pre pruned Tree")
plt.barh(range(len(indices)),importances[indices],color='violet',align="center" )
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Creating the model where we get highest train and test recall
clf = DecisionTreeClassifier(random_state=1)
path= clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas,path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000201 | 0.000602 |
| 2 | 0.000287 | 0.001176 |
| 3 | 0.000290 | 0.001756 |
| 4 | 0.000293 | 0.003806 |
| 5 | 0.000293 | 0.005564 |
| 6 | 0.000302 | 0.006167 |
| 7 | 0.000375 | 0.007291 |
| 8 | 0.000410 | 0.008521 |
| 9 | 0.000410 | 0.008931 |
| 10 | 0.000445 | 0.009821 |
| 11 | 0.000462 | 0.010282 |
| 12 | 0.000550 | 0.011932 |
| 13 | 0.000580 | 0.012512 |
| 14 | 0.000646 | 0.013158 |
| 15 | 0.000692 | 0.013850 |
| 16 | 0.000710 | 0.015270 |
| 17 | 0.000921 | 0.017113 |
| 18 | 0.002288 | 0.021689 |
| 19 | 0.002421 | 0.024110 |
| 20 | 0.002686 | 0.026796 |
| 21 | 0.004554 | 0.031350 |
| 22 | 0.010767 | 0.042117 |
| 23 | 0.026057 | 0.068174 |
| 24 | 0.051701 | 0.171576 |
fig,ax = plt.subplots(figsize=(15,5))
ax.plot(ccp_alphas[:-1],impurities[:-1],marker='o',drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total impurity vs effective alpha for training set")
Text(0.5, 1.0, 'Total impurity vs effective alpha for training set')
Next we will train decision tree using effective alphas. Last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1] with one node
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print ("Number of nodes in the last tree is {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is 1 with ccp_alpha: 0.05170099732449929
We will remove the last element in clfs and ccp_alphas, as it is trival tree with only one node.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2,1, figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts,marker='o',drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas,depth, marker='o',drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[0].set_title("Depth vs alpha")
fig.tight_layout()
recall_train =[]
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpah for training and testing datasets")
ax.plot(ccp_alphas,recall_train,label='train', marker='o',drawstyle="steps-post")
ax.plot(ccp_alphas,recall_test,label='test',marker='o',drawstyle="steps-post")
ax.legend()
plt.show()
#Create the model where we get the highest test and traing recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0009214046822742476, random_state=1)
decision_tree_postpruned_perf_train = model_performance_classification_sklearn(best_model,X_train,y_train)
decision_tree_postpruned_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.989846 | 0.931818 | 0.959866 | 0.945634 |
confusion_matrix_sklearn(best_model,X_train,y_train)
Checking model performacne on test set
decision_tree_postpruned_perf_test = model_performance_classification_sklearn(best_model,X_test,y_test)
decision_tree_postpruned_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.981714 | 0.883721 | 0.926829 | 0.904762 |
confusion_matrix_sklearn(best_model,X_test,y_test)
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of decision tree
print(tree.export_text(best_model, feature_names=feature_names,show_weights=True))
|--- Income <= 110.50 | |--- CCAvg <= 2.95 | | |--- weights: [2383.00, 2.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 92.50 | | | | |--- weights: [111.00, 10.00] class: 0 | | | |--- Income > 92.50 | | | | |--- Education <= 1.50 | | | | | |--- weights: [21.00, 2.00] class: 0 | | | | |--- Education > 1.50 | | | | | |--- weights: [7.00, 16.00] class: 1 | | |--- CD_Account > 0.50 | | | |--- weights: [3.00, 10.00] class: 1 |--- Income > 110.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [384.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [2.00, 50.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 3.50 | | | | |--- weights: [31.00, 7.00] class: 0 | | | |--- CCAvg > 3.50 | | | | |--- weights: [0.00, 7.00] class: 1 | | |--- Income > 116.50 | | | |--- weights: [0.00, 204.00] class: 1
# important features int the tree buildig
print(
pd.DataFrame(
best_model.feature_importances_,columns=["Imp"],index=X_train.columns
).sort_values(by="Imp",ascending=False)
)
Imp Income 0.386287 Education 0.382473 Family 0.168696 CCAvg 0.045155 CD_Account 0.017389 Age 0.000000 County_Santa Clara County 0.000000 County_Sacramento County 0.000000 County_San Benito County 0.000000 County_San Bernardino County 0.000000 County_San Diego County 0.000000 County_San Francisco County 0.000000 County_San Joaquin County 0.000000 County_San Luis Obispo County 0.000000 County_San Mateo County 0.000000 County_Santa Barbara County 0.000000 County_Siskiyou County 0.000000 County_Santa Cruz County 0.000000 County_Shasta County 0.000000 County_Placer County 0.000000 County_Solano County 0.000000 County_Sonoma County 0.000000 County_Stanislaus County 0.000000 County_Trinity County 0.000000 County_Tuolumne County 0.000000 County_Ventura County 0.000000 County_Riverside County 0.000000 County_Monterey County 0.000000 County_Orange County 0.000000 County_Fresno County 0.000000 Mortgage 0.000000 Securities_Account 0.000000 Online 0.000000 CreditCard 0.000000 County_Alameda County 0.000000 County_Butte County 0.000000 County_Contra Costa County 0.000000 County_El Dorado County 0.000000 County_Humboldt County 0.000000 County_Napa County 0.000000 County_Imperial County 0.000000 County_Kern County 0.000000 County_Lake County 0.000000 County_Los Angeles County 0.000000 County_Marin County 0.000000 County_Mendocino County 0.000000 County_Merced County 0.000000 Experience 0.000000 County_Yolo County 0.000000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title("Feature Importances")
plt.barh(range(len(indices)),importances[indices],color='violet',align='center')
plt.yticks(range(len(indices)),[feature_names[i]for i in indices])
plt.xlabel("Realtive Importance")
plt.show()
# training performance comparision
models_train_comp_df = pd.concat(
[ decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_postpruned_perf_train.T
],axis=1
)
models_train_comp_df.columns = [
"Decsion Tree sklearn",
"Decison Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)"
]
print("Training Performance Comparision")
models_train_comp_df
Training Performance Comparision
| Decsion Tree sklearn | Decison Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 1.0 | 0.988923 | 0.989846 |
| Recall | 1.0 | 0.922078 | 0.931818 |
| Precision | 1.0 | 0.959459 | 0.959866 |
| F1 | 1.0 | 0.940397 | 0.945634 |
Post Pruning has the better performance in all even though the performance difference is really very small between the post and pre pruning